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Abstract 

In the efficient global optimization problem, we minimize an un- 
known function /, using as few observations f{x) as possible. It can 
be considered a continuum-armed-bandit problem, with noiseless data, 
and simple regret. Expected- improvement algorithms are perhaps the 
most popular methods for solving the problem; in this paper, we pro- 
vide theoretical results on their asymptotic behaviour. 

Implementing these algorithms requires a choice of Gaussian-process 
prior, which determines an associated space of functions, its reproducing- 
kernel Hilbert space (RKHS). When the prior is fixed, expected im- 
provement is known to converge on the minimum of any function in 
its RKHS. We provide convergence rates for this procedure, optimal 
for functions of low smoothness, and describe a modified algorithm 
attaining optimal rates for smoother functions. 

In practice, however, priors are typically estimated sequentially 
from the data. For standard estimators, we show this procedure may 
never find the minimum of /. We then propose alternative estimators, 
chosen to minimize the constants in the rate of convergence, and show 
these estimators retain the convergence rates of a fixed prior. 

1 Introduction 

Suppose we wish to minimize a continuous function / : X — t- M, where X is 
a compact subset of M'^. Observing f{x) is costly (it may require a lengthy 
computer simulation or physical experiment), so we wish to use as few ob- 
servations as possible. We know little about the shape of /; in particular 
we will be unable to make assumptions of convexity or unimodality. We 
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therefore need a global optimization algorithm, one which attempts to find 
a global minimum. 

Many standard global optimization algorithms exist, including genetic 
algorithms, multistart, and simulated annealing (Pardalos and Romeijn, 
2002), but these algorithms are designed for functions that are cheap to 
evaluate. When / is expensive, we need an efficient algorithm, one which 
will choose its observations to maximize the information gained. 

We can consider this a continuum-armed-bandit problem (Srinivas et al., 
2010, and references therein), with noiseless data, and loss measured by the 
simple regret (Bubeck et al., 2009). At time n, we choose a design point 
Xn S X, make an observation Zn = f{xn), and then report a point where 
we believe /(x* ) will be low. Our goal is to find a strategy for choosing the 
Xn and j;*, in terms of previous observations, so as to minimize /(x*). 

We would like to find a strategy which can guarantee convergence: for 
functions / in some smoothness class, /(x*) should tend to min/, preferably 
at some fast rate. The simplest method would be to fix a sequence of Xn in 
advance, and set x* = argmin/„, for some approximation /„ to /. We will 
show that if fn converges in supremum norm at the optimal rate, then /(x* ) 
also converges at its optimal rate. However, while this strategy gives a good 
worst-case bound, on average it is clearly a poor method of optimization: 
the design points e completely independent of the observations z^- 

We may therefore ask if there are more efficient methods, with bet- 
ter average-case performance, that nevertheless provide good guarantees of 
convergence. The difficulty in designing such a method lies in the trade-off 
between exploration and exploitation. If we exploit the data, observing in 
regions where / is known to be low, we will be more likely to find the op- 
timum quickly; however, unless we explore every region of X, we may not 
find it at all (Macready and Wolpert, 1998). 

Initial attempts at this problem include work on Lipschitz optimization 
(summarized in Hansen et al., 1992) and the DIRECT algorithm (Jones 
et al., 1993), but perhaps the best-known strategy is expected improvement. 
It is sometimes called Bayesian optimization, and first appeared in Mockus 
(1974) as a Bayesian decision-theoretic solution to the problem. Contem- 
porary computers were not powerful enough to implement the technique in 
full, and it was later popularized by Jones et al. (1998), who provided a com- 
putationally efficient implementation. More recently, it has also been called 
a knowledge-gradient policy by Prazier et al. (2009). Many extensions and 
alterations have been suggested by further authors; a good summary can be 
found in Brochu et al. (2010). 

Expected improvement performs well in experiments (Osborne, 2010, 
§9.5), but little is known about its theoretical properties. The behaviour 
of the algorithm depends crucially on the Gaussian process prior vr chosen 
for /. Each prior has an associated space of functions Ti, its reproducing- 
kernel Hilbert space. 7i contains all functions X — )• M as smooth as a 
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posterior mean of /, and is the natural space in which to study questions of 
convergence. 

Vazquez and Beet (2010) show that when tt is a fixed Gaussian process 
prior of finite smoothness, expected improvement converges on the minimum 
of any f G Ti, and almost surely for / drawn from vr. Grunewalder et al. 
(2010) bound the convergence rate of a computationally infeasible version of 
expected improvement: for priors vr of smoothness v, they show convergence 
at a rate 0*{n~^'^^^'^^^'^) on / drawn from tt. We begin by bounding the 
convergence rate of the feasible algorithm, and show convergence at a rate 
Q*^^-(uM)/d^ on ah fen. We go on to show that a modification of 
expected improvement converges at the near-optimal rate 0*{n~'^^'^). 

For practitioners, however, these results are somewhat misleading. In 
typical applications, the prior is not held fixed, but depends on parameters 
estimated sequentially from the data. This process ensures the choice of 
observations is invariant under translation and scaling of /, and is believed 
to be more efficient (Jones et al., 1998, §2). It has a profound effect on 
convergence, however: Locatelli (1997, §3.2) shows that, for a Brownian 
motion prior with estimated parameters, expected improvement may not 
converge at all. 

We extend this result to more general settings, showing that for standard 
priors with estimated parameters, there exist smooth functions / on which 
expected improvement does not converge. We then propose alternative es- 
timates of the prior parameters, chosen to minimize the constants in the 
convergence rate. We show that these estimators give an automatic choice 
of parameters, while retaining the convergence rates of a fixed prior. 

Table 1 summarizes the notation used in this paper. We say / : M*^ — )• M 
is a bump function if / is infinitely differentiable and of compact support, 
and / : M'^ — )■ C is Hermitian if f{x) = f{—x). We use the Landau notation 
/ = 0{g) to denote limsup|//(7| < oo, and / = o{g) to denote f/g — 0. If 
g = 0{f), we say / = ^{g), and if both / = 0{g) and / = ^l{g), we say 
/ = Q{g)- If further f/g — )• 1, we say f ^ g. Finally, if / and g are random, 
and P(sup|//5| < M) 1 as M oo, we say / = Op{g). 

In Section 2, we briefly describe the expected-improvement algorithm, 
and detail our assumptions on the priors used. We state our main results 
in Section 3, and discuss implications for further work in Section 4. Finally, 
we give proofs in Appendix A. 

2 Expected Improvement 

Suppose we wish to minimize an unknown function /, choosing design points 
Xn and estimated minima x* as in the introduction. If we pick a prior 
distribution vr for /, representing our beliefs about the unknown function, 
we can describe this problem in terms of decision theory. Let (0, J^, P) be 
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Section 1 


/ 


unknown function X — )• M to be minimized 


X 


compact subset of to minimize over 


d 


number of dimensions to minimize over 




points in X at which / is observed 




observations z„ = f{xn) of / 


* 


estimated minimum of /, given zi, . . . , z„ 


Section 2.1 


TT 


prior distribution for / 


U 


strategy for choosing x„, x* 




filtration = Zi : i < n) 




best observation z* = minj=i^...^„ Zi 


EIn 


expected improvement given J^n 


Section 2.2 


/i, cr^ 


global mean and variance of Gaussian-process prior vr 




underlying correlation kernel for vr 




correlation kernel for vr with length-scales 9 




smoothness parameters of K 


An) /ri) •Sn) -^n 


quantities describing posterior distribution of / given Tn 


Section 2.3 


EI{tt) 


expected improvement strategy with fixed prior 




estimates of prior parameters o"^, 9 




rate of decay of 




bounds on 9n 


£;/(7r) 


expected improvement strategy with estimated prior 


Section 3.1 


-HeiS) 


reproducing-kernel Hilbert space of Kg on S 


H'{D) 


Sobolev Hilbert space of order s on D 


Section 3.2 


Ln 


loss suffered over an RKHS ball after n steps 


Section 3.3 


EI{^) 


expected improvement strategy with robust estimated prior 


Section 3.4 


EI{-,e) 


e-greedy expected improvement strategies 



Table 1: Notation 
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a probability space, equipped with a random process / having law vr. A 
strategy u is a collection of random variables {xn), {x^) taking values in 
X. Set Zn ■= f{xn), and define the filtration Tn '■= a{xi,Zi : i < n). The 
strategy u is valid if Xn is conditionally independent of / given Tn~i, and 
likewise x* given J^n- (Note that we allow random strategies, provided they 
do not depend on unknown information about /.) 

When taking probabilities and expectations we will write P" and E^, 
denoting the dependence on both the prior vr and strategy u. The average- 
case performance at some future time is then given by the expected loss, 

E:^[/(x^)-min/], 

and our goal, given vr, is to choose the strategy u to minimize this quantity. 

2.1 Bayesian Optimization 

For > 1 this problem is very computationally intensive (Osborne, 2010, 
§6.3), but we can solve a simplified version of it. First, we restrict the choice 
of X* to the previous design points xi, . . . , x„. (In practice this is reasonable, 
as choosing an x* we have not observed can be unreliable.) Secondly, rather 
than finding an optimal strategy for the problem, we derive the myopic 
strategy: the strategy which is optimal if we always assume we will stop after 
the next observation. This strategy is suboptimal (Ginsbourger et al., 2008, 
§3.1), but performs well, and greatly simplifies the calculations involved. 

In this setting, given Tn, if we are to stop at time n we should choose 
X* := Xj. , where := argmini^...^„ Zj. (In the case of ties, we may pick any 
minimizing i*.) We then suffer a loss z* — min/, where z* := Zj* . Were we 
to observe at x^+i before stopping, the expected loss would be 

E^[4+i-min/| J-J, 

so the myopic strategy should choose x„+i to minimize this quantity. Equiv- 
alently, it should maximize the expected improvement over the current loss, 

i?/„(x„+i;vr) := KK " <+i I ^n] = K[i< " ^n+i)+ | -^J, (1) 

where x"*" = max(x,0). 

So far, we have merely replaced one optimization problem with another. 
However, for suitable priors, EI^ can be evaluated cheaply, and thus maxi- 
mized by standard techniques. The expected-improvement algorithm is then 
given by choosing x„+i to maximize (1). 

2.2 Gaussian Process Models 

We still need to choose a prior vr for /. Typically, we model / as a stationary 
Gaussian process: we consider the values /(x) to be jointly Gaussian, with 
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mean and covariance 

E^[/(x)] = fi, Cov^[/(x), /(y)] = a^Keix - y). (2) 

/Lt G M is the global mean of /; we place a flat prior on reflecting our 
uncertainty over the location of /. 

(T > is the global scale of variation of /, and Kg : M"' — )• M its correlation 
kernel, governing the local properties of /. In the following, we will consider 
kernels 

Keih, . . . ,td) ■.= K{h/eu...,td/9d), (3) 

for an underlying kernel K with K(0) = 1. (Note that we can always satisfy 
this condition by suitably scaling K and a.) The > are the length-scales 
of the process: two values /(x) and f{y) will be highly correlated if each 
Xi — yi is small compared with 6i. For now, we will assume the parameters 
a and 6 are fixed in advance. 

For (2) and (3) to define a consistent Gaussian process, K must be 
a symmetric positive-definite function. We will also make the following 
assumptions. 

Assumption 1. K is continuous and integrahle. 
K thus has Fourier transform 

and by Bochner's theorem, K is non-negative and integrable. 

Assumption 2. K is isotropic and radially non-increasing. 

In other words, K{x) = k{\\x\\) for a non-increasing function k : [0, oo) — )• 
[0,oo); as a consequence, K is isotropic. 

Assumption 3. As x — )• oo, either: 

(i) K{x) = 0(||x||~^'^~'^) for some v > 0; or 

(ii) K{x) = 0(||3;||~^'^~'^) for all u > (we will then say that u = oo). 
Note the condition v > is required for K to be integrable. 

Assumption 4. K is , for k the largest integer less than 2v, and at the 

origin, K has k-th order Taylor approximation satisfying 

\K{x)-P,,{x)\=o(\\xf''{-\og\\x\\f^) 
as X — )■ 0, for some a > 0. 
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When a = 0, this is just the condition that K be 2z/-Holder at the origin; 
when a > 0, we instead require this condition up to a log factor. 

The rate v controls the smoothness of functions from the prior: almost 
surely, / has continuous derivatives of any order k < v (Adler and Taylor, 
2007, §1.4.2). Popular kernels include the Matern class, 

K^'ix) := ^ [y/2v\\x\\^'' K (y/2u\\x\\^ , v G (0,oo), 

where is a modified Bessel function of the second kind, and the Gaussian 
kernel, 

K'^^x) := e 

obtained in the limit — )• oo (Rasmussen and Williams, 2006, §4.2). Between 
them, these kernels cover the full range of smoothness < < oo. Both 
kernels satisfy Assumptions 1-4 for the v given; a = except for the Matern 
kernel with G N, where a = ^ (Abramowitz and Stegun, 1965, §9.6). 

Having chosen our prior distribution, we may now derive its posterior. 
We find 

f{x) I zi, . . . ~ iV (jn{x\9),a'^si{x]6)^ , 



2IFII . 



where 



and 



fn{x-.e):=fln + V^V-\z-flnl). (5) 

4(x; 9):= I- v^V-'v + \fy~,["^\ (6) 



for z = (zi)^=i, V = {Ke{xi - Xj))'^^^^, and v = {Kq^x - Xi))'^^^ (Santner 
et al., 2003, §4.1.3). Equivalently, these expressions are the best linear 
unbiased predictor of f{x) and its variance, as given in Jones et al. (1998, 
§2). We will also need the reduced sum of squares. 



Ri{e):={z-finiyV-'{z-f,nl). (7) 
2.3 Expected Improvement Strategies 

Under our assumptions on vr, we may now derive an analytic form for (1), 
as in Jones et al. (1998, §4.1). We obtain 



EIniXn+i;n) = pi^Z^- fniXn+i;9),aSniXn+i;0)j , (8) 

where 

p{y,s):=h''^y/'^^'^^y^'^^ (9) 
I max(y, 0), s = 0, 
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and <I> and (/? are the standard normal distribution and density functions 
respectively. 

For a prior vr as above, expected improvement chooses Xn+i to maximize 
(8), but this does not fully define the strategy. Firstly, we must describe 
how the strategy breaks ties, when more than one x G X maximizes EIn. 
In general, this will not affect the behaviour of the algorithm, so we allow 
any choice of Xn+i maximizing (8). 

Secondly, we must say how to choose xi, as the above expressions are un- 
defined when n = 0. In fact, Jones et al. (1998, §4.2) find that expected im- 
provement can be unreliable given few data points, and recommend that sev- 
eral initial design points be chosen in a random quasi- uniform arrangement. 
We will therefore assume that until some fixed time k, points X ]^ , . . . ^ X}^ are 
instead chosen by some (potentially random) method independent of /. We 
thus obtain the following strategy. 

Definition 1. An EI{tt) strategy chooses: 

(i) initial design points xi, . . . ,Xk independently of f; and 

(a) further design points Xn+i {n > k) from the maximizers of (8). 

So far, we have not considered the choice of parameters a and 9. While 
these can be fixed in advance, doing so requires us to specify characteris- 
tic scales of the unknown function /, and causes expected improvement to 
behave differently on a rescaling of the same function. We would prefer an 
algorithm which could adapt automatically to the scale of /. 

A natural approach is to take maximum likelihood estimates of the pa- 
rameters, as recommended by Jones et al. (1998, §2). Given 0, the MLE 
(T^ = R^{6)/n; for full generality, we will allow any choice o"^ = CnR^{0)^ 
where Cn = o(l/logn). Estimates of 6, however, must be obtained by nu- 
merical optimization. As can vary widely in scale, this optimization is best 
performed over log^; as the likelihood surface is typically multimodal, this 
requires the use of a global optimizer. We must therefore place (implicit or 
explicit) bounds on the allowed values of \og9. We have thus described the 
following strategy. 

Definition 2. Let Tin he a sequence of pviOTSj with pCLTClTTlctCTS O'fi, Ofi satis- 
fying: 

(i) fj^ = CnRni^n) for Constants Cn > 0, c„ = o(l/logn); and 

(ii) <en< for constants 9^,9^ 

An EI{'k) strategy satisfies Definition 1, replacing vr with 7r„ in (8). 
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3 Convergence Rates 



To discuss convergence, we must first choose a smoothness class for the 
unknown function /. Each kernel Kq is associated with a space of functions 
7ig{X), its reproducing-kernel Hilbert space (RKHS) or native space. He^X) 
contains all functions X — )• R as smooth as a posterior mean of /, and is 
the natural space to study convergence of expected-improvement algorithms, 
allowing a tractable analysis of their asymptotic behaviour. 

3.1 Reproducing-Kernel Hilbert Spaces 

Given a symmetric positive-definite kernel K on M*^, set kx{t) = K{t — x). 
For S C R"^, let £{S) be the space of functions — )• M spanned by the kx, 
for X £ S. Furnish £{S) with the inner product defined by 

{kx,ky) := K{x - y). 

The completion of £-{S) under this inner product is the reproducing-kernel 
Hilbert space ^.{S) of K on S. The members / E "^(5) are abstract objects, 
but we can identify them with functions / : — )• M through the reproducing 
property, 

f{x) = {f,kx), 

which holds for all / G £{S)- See Aronszajn (1950), Berlinet and Thomas- 
Agnan (2004), Wendland (2005) and van der Vaart and van Zanten (2008). 

We will find it convenient also to use an alternative characterization of 
^{S). We begin by describing H^U!^) in terms of Fourier transforms. Let / 
denote the Fourier transform of a function f £ L^. The following result is 
stated in Parzen (1963, §2), and proved in Wendland (2005, §10.2); we give 
a short proof in Appendix A. 

Lemma 1. 'H{M.'^) is the space of real continuous f G L'^(M.'^) whose norm 




is finite, taking 0/0 = 0. 

We may now describe 7^(5) in terms of T-L{M.'^). 

Lemma 2 (Aronszajn, 1950, §1.5). T-l{S) is the space of functions f = g\s 
for some g € ^(M*^), with norm 

and there is a unique g minimizing this expression. 
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These spaces are in fact closely related to the Sobolev Hilbert spaces of 
functional analysis. Say a domain D C. is Lipschitz if its boundary is 
locally the graph of a Lipschitz function (see Tartar, 2007, §12, for a precise 
definition). For such a domain D, the Sobolev Hilbert space H^{D) is the 
space of functions f : D ^ W, given by the restriction of some (7 : M'^ — t- M, 
whose norm 

is finite. Thus, for the kernel K with Fourier transform K(^) = (1 + 
this is just the RKHS T-L{D). More generally, if K satisfies our assumptions 
with 1/ < 00, these spaces are equivalent in the sense of normed spaces: they 
contain the same functions, and have norms || • ||]^, || • satisfying 

C||/|ll<||/|l2<C'||/||„ 

for constants < C < C". 

Lemma 3. Let UeiS) denote the RKHS of Kg on S, and D C R'^ be a 
Lipschitz domain. 

(i) If V < 00, 'Hg{D) is equivalent to the Sobolev Hilbert space H'^'^^^'^ (D) . 

(a) If V = 00, %q{D) is continuously embedded in H^{D) for all s. 

Thus \i V < 00, and X is, say, a product of intervals nf=i[^«i^i]) the 
RKHS He{X) is equivalent to the Sobolev Hilbert space ^''+°'/^(riti(ai, h)), 
identifying each function in that space with its unique continuous extension 
to X. 

3.2 Fixed Parameters 

We are now ready to state our main results. Let X dW^ he, compact with 
non-empty interior. For a function / : X — )■ M, let and Wj denote prob- 
ability and expectation when minimizing the fixed function / with strategy 
u. (Note that while / is fixed, u may be random, so its performance is 
still probabilistic in nature.) We define the loss suffered over the ball Br in 
'Hq{X) after n steps by a strategy u, 

Ln{u,ne{X),R):= sup E^[/(x;) - min /]. 
\\f\\ng(x)<R 

We will say that u converges on the optimum at rate r„, if 

Ln{u,ne{X),R) = 0{rn) 

for all -R > 0. Note that we do not allow u to vary with R; the strategy 
must achieve this rate without prior knowledge of 

We begin by showing that the minimax rate of convergence is n~'^/'^. 
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Theorem 1. If v < oo, then for any G M'^, i? > 0, 
\niLn{u,ne{X),R) = e{n'''/'^), 

u 

and this rate can he achieved by a strategy u not depending on R. 

The upper bound is provided by a naive strategy as in the introduction: 
we fix a quasi-uniform sequence Xn in advance, and take x* to minimize 
a radial basis function interpolant of the data. As remarked previously, 
however, this naive strategy is not very satisfying; in practice it will be 
outperformed by any good strategy varying with the data. We may thus ask 
whether more sophisticated strategies, with better practical performance, 
can still provide good worst-case bounds. 

One such strategy is the EI (it) strategy of Definition 1. We can show 
this strategy converges at least at rate n^i'^^'^)/'^ ^ up to log factors. 

Theorem 2. Let tt be a prior with length-scales 9 G M*^. For any R > 0, 

.„(^/(.),«„(x),H)J^;»::;>-'"'' ^-l- 

[0{n ^/'*), u > I. 

For v < I, these rates are near-optimal. For > 1, we are faced with a 
more difficult problem; we discuss this in more detail in Section 3.4. 

3.3 Estimated Parameters 

First, we consider the effect of the prior parameters on EI[tt). While the 
previous result gives a convergence rate for any fixed choice of parameters, 
the constant in that rate will depend on the parameters chosen; to choose 
well, we must somehow estimate these parameters from the data. The El^n) 
strategy, given by Definition 2, uses maximum likelihood estimates for this 
purpose. We can show, however, that this may cause the strategy to never 
converge. 

Theorem 3. Suppose v < oo. Given 6 € M^^., ii > 0, e > 0, there exists 
f G 'H0{X) satisfying \\f\\y_g(^x) — '^'"^^ /^'^ some fixed 5 > 0, 

^f^^^ (inf /«) - min/ > (5) > 1 - e. 

The counterexamples constructed in the proof of the theorem may be 
difficult to minimize, but they are not badly-behaved (Figure 1). A good 
optimization strategy should be able to minimize such functions, and we 
must ask why expected improvement fails. 

We can understand the issue by considering the constant in Theorem 2. 
Define 

t(x) := x<I>(x) -|- ^{x). 
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> X 



Figure 1: A counterexample from Theorem 3 



From the proof of Theorem 2, the dominant term in the convergence rate 
has constant 

r{R/a) 



C{R + a)- 



(10) 



r{-R/cjy 

for C > not depending on R or a. In Appendix A, we will prove the 
following result. 



Corollary 1. Rn{0) is non- decreasing in n, and bounded above by 



He(X)- 



Hence for fixed 9, the estimate = R^{9)/n < R'^/n, and thus R/an > 
n^/^. Inserting this choice into (10) gives a constant growing exponentially 
in n, destroying our convergence rate. 

To resolve the issue, we will instead try to pick a to minimize (10). The 
term R + a is increasing in a, and the term T{R/a)/T{—R/a) is decreasing 
in cr; we may balance the terms by taking a = R. The constant is then 
proportional to -R, which we may minimize by taking R = -^^^ 
practice, we will not know advance, so we must estimate it 

from the data; from Corollary 1, a convenient estimate is Rn{0). 

Suppose, then, that we make some bounded estimate 9n of 6, and set 
As Theorem 3 holds for any cr^ of faster than logarithmic 



decay, such a choice is necessary to ensure convergence. (We may also choose 
9 to minimize (10); we might then pick 9n minimizing Rn{9)Y\'^^i6- 
but our assumptions on On are weak enough that we need not consider this 
further.) 

If we believe our Gaussian-process model, this estimate cj„ is certainly 
unusual. We should, however, take care before placing too much faith in 
the model. The function in Figure 1 is a reasonable function to optimize. 
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but as a Gaussian process it is highly atypicah there are intervals on which 
the function is constant, an event which in our model occurs with proba- 
bility zero. If we want our algorithm to succeed on more general classes of 
functions, we will need to choose our parameter estimates appropriately. 

To obtain good rates, we must add a further condition to our strategy. 
If zi = • • • = Zn, EIn{- ;TTn) is identically zero, and all choices of Xn+i are 
equally valid. To ensure we fully explore /, we will therefore require that 
when our strategy is applied to a constant function f{x) = c, it produces a 
sequence Xn dense in X. (This can be achieved, for example, by choosing 
Xn+i uniformly at random from X when zi = ■ ■ ■ = Zn.) We have thus 
described the following strategy. 

Definition 3. An EI{tt) strategy satisfies Definition 2, except: 

(i) we instead set cj^ = n^{6n); and 

(ii) we require the choice of Xn+i maximizing (8) to he such that, if f is 
constant, the design points are almost surely dense in X. 

We cannot now prove a convergence result uniform over balls in 'Hg{X), 
as the rate of convergence depends on the ratio R/Rn, which is unbounded. 
(Indeed, any estimator of ||/||-Hg(js:) ™ust sometimes perform poorly: / can 
appear from the data to have arbitrarily small norm, while in fact having a 
spike somewhere we have not yet observed.) We can, however, provide the 
same convergence rates as in Theorem 2, in a slightly weaker sense. 

Theorem 4. For any f € %gu{X), under P^'^*-'^\ 
/«) - min/ : 



Op(n-'^/'^(logn)"), i/<l, 
Op(n~i/'^), v>l. 



3.4 Near-Optimal Rates 



So far, our rates have been near-optimal only for v < 1. To obtain good 
rates for v > 1, standard results on the performance of Gaussian-process 
interpolation (Narcowich et al., 2003, §6) then require the design points Xi 
to be quasi- uniform in a region of interest. It is unclear whether this occurs 
naturally under expected improvement, but there are many ways we can 
modify the algorithm to ensure it. 

Perhaps the simplest, and most well-known, is an e-greedy strategy (Sut- 
ton and Barto, 1998, §2.2). In such a strategy, at each step with probability 
1 — e we make a decision to maximize some greedy criterion; with probability 
e we make a decision completely at random. This random choice ensures 
that the short-term nature of the greedy criterion does not overshadow our 
long-term goal. 
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The parameter e controls the trade-off between global and local search: 
a good choice of e will be small enough to not interfere with the expected- 
improvement algorithm, but large enough to prevent it from getting stuck in 
a local minimum. Sutton and Barto (1998, §2.2) consider the values e = 0.1 
and £ = 0.01, but in practical work e should of course be calibrated to a 
typical problem set. 

We therefore define the following strategies. 

Definition 4. Let ■ denote vr, vr or vr. For < e < 1, an EI{ ■ ,e) strategy: 

(i) chooses initial design points xi, • • • ,Xk independently of f; 

(a) with probability I — £, chooses design point Xn+i {n > k) as in EI{ ■ ); 
or 

(Hi) with probability e, chooses Xn+i {n > k) uniformly at random from X . 

We can show that these strategies achieve near-optimal rates of conver- 
gence for all < oo. 

Theorem 5. Let EL{ ■ ,e) be one of the strategies in Definition 4- If < oo, 
then for any R > 0, 

Ln{EI{ ■ , s),ngu {X), R) = 0{{n/ log n)-"/'^(log n)"), 

while if V = CO, the statement holds for all v < co. 

Note that unlike a typical e-greedy algorithm, we do not rely on random 
choice to obtain global convergence: as above, the EI{tt) and EI[tt) strate- 
gies are already globally convergent. Instead, we use random choice simply 
to improve upon the worst-case rate. Note also that the result does not 
in general hold when e = 1; to obtain good rates, we must combine global 
search with inference about /. 

4 Conclusions 

We have shown that expected improvement can converge near-optimally, but 
a naive implementation may not converge at all. We thus echo Diaconis and 
Freedman (1986) in stating that, for infinite-dimensional problems, Bayesian 
methods are not always guaranteed to find the right answer; such guarantees 
can only be provided by considering the problem at hand. 

We might ask, however, if our framework can also be improved. Our 
upper bounds on convergence were established using naive algorithms, which 
in practice would prove inefficient. If a sophisticated algorithm fails where a 
naive one succeeds, then the sophisticated algorithm is certainly at fault; we 
might, however, prefer methods of evaluation which do not consider naive 
algorithms so successful. 
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Vazquez and Beet (2010) and Grunewalder et al. (2010) consider a more 
Bayesian formulation of the problem, where the unknown function / is dis- 
tributed according to the prior tt, but this approach can prove restrictive: 
as we saw in Section 3.3, placing too much faith in the prior may exclude 
functions of interest. Further, Grunewalder et al. find the same issues are 
present also within the Bayesian framework. 

A more interesting approach is given by the continuum-armed-bandit 
problem (Srinivas et al., 2010, and references therein). Here the goal is to 
minimize the cumulative regret, 

n 

Rn ■= ^{f{xi) - min/), 

i=l 

in general observing the function / under noise. Algorithms controlling the 
cumulative regret at rate r„ also solve the optimization problem, at rate 
rn/n (Bubeck et al., 2009, §3). The naive algorithms above, however, have 
poor cumulative regret. We might, then, consider the cumulative regret to 
be a better measure of performance, but this approach too has limitations. 
Firstly, the cumulative regret is necessarily increasing, so cannot establish 
rates of optimization faster than . (This is not an issue under noise, 
where typically r„ = r2(n^/^), see Kleinberg and Slivkins, 2010.) Secondly, 
if our goal is optimization, then minimizing the regret, a cost we do not 
incur, may obscure the problem at hand. 

Bubeck et al. (2010) study this problem with the additional assumption 
that / has finitely many minima, and is, say, quadratic in a neighbourhood 
of each. This assumption may suffice in practice, and allows the authors to 
obtain impressive rates of convergence. For optimization, however, a further 
weakness is that these rates hold only once the algorithm has found a basin 
of attraction; they thus measure local, rather than global, performance. 
It may be that convergence rates alone are not sufficient to capture the 
performance of a global optimization algorithm, and the time taken to find a 
basin of attraction is more relevant. In any case, the choice of an appropriate 
framework to measure performance in global optimization merits further 
study. 

Finally, we should also ask how to choose the smoothness parameter v 
(or the equivalent parameter in similar algorithms). Van der Vaart and van 
Zanten (2009) show that Bayesian Gaussian-process models can, in some 
contexts, automatically adapt to the smoothness of an unknown function 
/. Their technique requires, however, that the estimated length-scales On to 
tend to 0, posing both practical and theoretical challenges. The question of 
how best to optimize functions of unknown smoothness remains open. 
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A Proofs 

We now prove the results in Section 3. 

A.l Reproducing-Kernel Hilbert Spaces 

Proof of Lemma 1. Let V be the space of functions described, and W be 
the closed real subspace of Hermitian functions in L'^{M.'^, K~^). We will 
show / I— )• / is an isomorphism V ^ W, so we may equivalently work with 
W. Given / G W, by Cauchy-Schwarz and Bochner's theorem, 

< OO, 



/|/I<(/a-) (/i/T/a-) 



and as < ll-f^lli. 



j\f\'<\\K\\^j\f\'/k<^, 



so / G n L^. / is thus the Fourier transform of a real continuous f G L"^, 
satisfying the Fourier inversion formula everywhere. 

/ I— )■ / is hence an isomorphism V ^ W. It remains to show that 
V = T-L{R'^). W is complete, so V is. Further, £{R^) C V, and by Fourier 
inversion each f G V satisfies the reproducing property. 



fix) = [ e'-^^^'^^md^ = I ^-^^^di = {f,k.), 

so niM'^) is a closed subspace of V. Given / e f{x) = (/, A;^.) = 

forallx, so/ = 0. Thusy = ?^(M'^). □ 

Proof of Lemma 3. By Lemma 1, the norm on %q{W^) is 

1/(01' 



Keii) 



di, 



and Kq has Fourier transform 

"'^^^^ — nt:^^ — 
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If 1/ < oo, by assumption K{^) = A;(||^||), for a finite non-increasing function 
k satisfying k{U\\) = 6(11^11"^''"'^) as ^ ^ oo. Hence 

C7(l + ||^||2)-(.+d/2) < j^^(^) < ^/(^ ^ ||^||2)-{.+d/2)^ 

for constants C,C' > 0, and we obtain that is equivalent to the 

Sobolev space H''+'^/^ {R'^) . 

Prom Lemma 2, 'H0{D) is given by the restriction of functions in T-LQiW^); 
as D is Lipschitz, the same is true of H'^+'^l'^ _ T-Lo{D) is thus equivalent 
to i?^+"'/2(L>). Finally, functions in neiD) are continuous, so uniquely 
identified by their restriction to D, and 

ne{D) ^ne{D) ^ H''+'^/\D). 

If v = oo, by a similar argument T-Lq{D) is continuously embedded in all 
H'[D). □ 

Prom Lemma 1, we can derive results on the behaviour of 11/11-^^(5) as 9 
varies. Por small 0, we obtain the following result. 

Lemma 4. // / G UeiS), then f G ne'{S) for all 0<e' <e, and 

Proof. Let C = Y[i=ii^i/^i)- ^ isotropic and radially non-increasing. 

Given / G 'H0{S), let g G 'H6i(M'^) be its minimum norm extension, as in 
Lemma 2. By Lemma 1, 

Likewise, for large 0, we obtain the following. 
Lemma 5. If v < oo, f G 'H0{S), then f G UteiS) for t>l, and 

ll/llL(5)<^"*'ll/ll?..(5)> 
for a C" > depending only on K and 6. 

Proof. As in the proof of Lemma 3, we have constants C, C" > such that 
C7(l + ||^||2)-(^+'i/2) < KeiO < C'{1 + W^fy^^+^/^l 
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Thus for t > 1, 

and we may argue as in the previous lemma. □ 

We can also describe the posterior distribution of / in terms of 
as a consequence, we may deduce Corollary 1. 

Lemma 6. Suppose f{x) = fJ. + g{x), g € HeiS). 

("i) fn{x',0) = fin + gnix) solves the optimization problem 

minimize \\g\\\ig(^s)^ subject to fi + g{xi) = Zi, 1 < i < re, 
with minimum value R^iO). 
(ii) The prediction error satisfies 

\f{x) - fn{x;6)\ < Sn(x;6l)||5||^^(5) 
with equality for some g G T-Lei^S). 

Proof. 

(i) Let W = span{kx^ , . . . , kx„) , and write 9 = 5"+ ff"*" for G W, 
g-^ G W-^. g'^{xi) = {g-^,kxi) = 0, so g-^ affects the optimization only 
through \\g\\. The minimal g thus has 5"*- = 0, so ^ = "^27=1 ^i^Xi- The 
problem then becomes 

minimize X^^VX, subject to fil + V\ = z. 

The solution is given by (4) and (5), with value (7). 

(ii) By symmetry, the prediction error does not depend on ^u, so we may 
take /i = 0. Then 

f{x) - fn{x]6) = g{x) - ifin + gnix)) = {g,en,x), 

for en,x = kx - Y17=i ^ikx,, and 



Now, ||en,x||^g(5) = s^{x\Q\ as given by (6); this is a consequence of 
Loeve's isometry, but is easily verified algebraically. The result then 
follows by Cauchy-Schwarz. □ 
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A. 2 Fixed Parameters 



Proof of Theorem 1. We first establish the lower bound. Suppose we have 
2n functions ijjm with disjoint supports. We will argue that, given n observa- 
tions, we cannot distinguish between all the ipmi and thus cannot accurately 
pick a minimum x* . 

To begin with, assume X = [0, 1]*^. Let ^ -.W^ ^ [0, 1] be a C°° function, 
supported inside X and with minimum -1. By Lemma 3, V G Fix 
A; G N, and set n = {2kY /2. For vectors m G {0, . . . , 2A; — l}'^, construct 
functions ipmix) = C{2k)~'^'ip{2kx — m), where C > is to be determined. 
tpm is given by a translation and scaling of ip, so by Lemmas 1, 2 and 5, for 
some C > 0, 

UmW-HoiX) < IIV'm||^,(IRd) = C {2k)-'' < CC'\\tl;\\^^^^,y 

Set C = R/C'\\ip\\y^^i^^dy so that ||V'm||-^g(x) — ^'^d A:. 

Suppose / = 0, and let a;„ and x* be chosen by any valid strategy u. 
Set X = {^ii • • • iXn-i,x'^-i}, and let be the event that tpm{x) = for 
all X £ X- There are n points in Xi and the 2n functions 'i/'m have disjoint 
support, so J2m I(^m) > Thus 



> n, 



and we have some fixed m, depending only on u, for which PJ^(A„) > i. On 
the event A^, 

- minV'm = C{2ky, 
but on that event, u cannot distinguish between and -0^ before time n, so 

C-1(2A)'^E;^[/(x:„i) - min/] > P;„(A„) = P;j(A^) > i. 

As the minimax loss is non-increasing in n, for {2{k — l))^/2 < n < 
{2kY /2 we conclude 

inf L„(M,^0(X),ii) > inf L(2fe)d/2_i(u,7^0(X),i?) 

> inf supE^^ [/ fx|'2fc)d/2_i) - min/ 



> \C{2k)-'' = ^{n-"/'^). 

For general X having non-empty interior, we can find a hypercube S = 
xq + [0,e]'^ C X, with e > 0. We may then proceed as above, picking 
functions ipm supported inside S. 

For the upper bound, consider a strategy u choosing a fixed sequence 
Xn, independent of the Fit a radial basis function interpolant fn to the 
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data, and pick x* to minimize /„. Then if x* minimizes /, 

f{xl) - fix*) < f{x*J - /„«) + Ux*) - fix*) 

<2||/n-/|L> 

SO the loss is bounded by the error in /„. 

From results in Narcowich et al. (2003, §6) and Wendland (2005, §11.5), 
for suitable radial basis functions the error is uniformly bounded by 

sup \\fn-f\\oo=Oih~n, 
\\f\\ne(x)<R 

where the mesh norm 

n 

hn '■= sup min||j; — 

(For u ^ N, this result is given by Narcowich et al. for the radial basis 
function K'^, which is z/-Holder at by Abramowitz and Stegun, 1965, §9.6; 
for € N, the result is given by Wendland for thin-plate splines.) As X is 
bounded, we may choose the Xn so that hn = ©(n-i/'^), givmg 

Lniu,HeiX),R) = Oin-''/''). □ 

To prove Theorem 2, we first show that some observations Zn will be 
well-predicted by past data. 

Lemma 7. Set 

[0, i/>l. 

Given 9 G W^, there is a constant C > depending only on X, K and 
which satisfies the following. For any G N, and sequences Xn ^ X , 6n> 9, 
the inequality 

SniXn+l-.en)>C'k-'^-^^^'\\ogk)^ 

holds for at most k distinct n. 

Proof. We first show that the posterior variance is bounded by the dis- 
tance to the nearest design point. Let 7r„ denote the prior with variance 
cr^ = 1, and length-scales 9n- Then for any i < n, as fnix;9n) = ]E^„[/(a;) | 

J' ri\i 

slix; 9n) = [(/(x) - fnix; 9^))^ \ Fn] 

= [ifix) - fiXi))^ - ifixi) - fnix; 9n))^ I Fn] 

<E^„[ifix)-fix,))^\Tn] 
= 2il-KeSx-Xi)). 
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If < ^, then by assumption 

\K{x)-KiO)\=o(\\xf''{-log\\x\\f") 

as X ^ 0. Uu > ^, then K is difFerentiable, so as K is symmetric, VK{0) - 
0. If further u < 1, then 

\Kix) - K{0)\ = \K{x) - K{0) - X ■ VK{0)\ = O (||a;f log\\x\\f"^ . 

Similarly, if u > 1, then if is C^, so 

\K{x) - K{0)\ = \K{x) - K{0) - X ■ VK{0)\ = 0{\\xf). 
We may thus conclude 

|1 - K{x)\ = \K{x) - K{0)\ = O (||xf (''''^)(-log||x||)2^) , 



and 



slix; On) < C^x - x.f^'^'^'H- log\\x - x,\\f^, 



for a constant C > depending only on X, K and 0. 

We next show that most design points Xn+i are close to a previous Xj. 
X is bounded, so can be covered by k balls of radius 0{k~^^'^). If :r„+i lies 
in a ball containing some earlier point Xi, i < n, then we may conclude 

sl{xn+^■,en)<C"k-'('^^'y''{logkf^ 

for a constant C" > depending only on X, K and 9. Hence as there are k 
balls, at most k points Xn+i can satisfy 

SniXn+M > C'k-^^'^^y^logkf. □ 

Next, we provide bounds on the expected improvement when / lies in 
the RKHS. 

Lemma 8. Let \\f\\^^(^x) < R- For x e X , n e N, set I = {f{x*J - f{x))+, 
and s = Sn{x; 6). Then for 

t{x) ■— x^{x) + ^(x), 

we have 

max (l - Rs, I] < EIn{x; tt)<I + {R + a)s. 
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Proof. If s = 0, then by Lemma 6, fn{x;6) = f{x), so EIn{x;TT) = I, 
and the result is trivial. Suppose s > 0, and set t = (/(x*) — f{x))/s, 
u = (/(x*) - fn{x-e))/s. From (8) and (9), 

EIn{x;TT) = asT{u/a), 

and by Lemma 6, — i| < R. As t'{z) = ^{z) G [0, 1], r is non-decreasing, 
and t(z) < 1 + z for z >0. Hence 

EIn{x] tt) < asT { — - — \ <(7s{ — - — + 1 j =I+{R + a)s. 

If I = 0, then as EI is the expectation of a non-negative quantity, 
EI > 0, and the lower bounds are trivial. Suppose / > 0. Then as EI > 0, 
t{z) > for all z, and t{z) = z + t{—z) > z. Thus 

EIn{x; vr) > asT f ^— > os f ^— ^) = I - Rs. 



a 



EIn{x;7r) > ar ( — ) s. 



Also, as r is increasing. 



Combining these bounds, and eliminating s, we obtain 

ElJx; vr > ^ , 'J, T = \^ ' I. □ 

We may now prove the theorem. We will use the above bounds to show 
that there must be times when the expected improvement is low, and 
thus f{x*n^^) is close to min/. 

Proof of Theorem 2. From Lemma 7 there exists C > 0, depending on X, 
K and 6, such that for any sequence x„ G X and A; G N, the inequality 

Sn{Xn+i;e)>Ck-^^'^^^'\\ogkf 

holds at most k times. Furthermore, z* — z'^j^-^ > 0, and for ^ ^i 

Y^zl- < zl - min/ < 2II/IL < 2i?, 

n 

so 2* — z^j^i > 2Rk~^ at most k times. Since z* — f{xn+i) < z* — z*_^_l, 
we have also — f{xn+i) > 2Rk~^ at most k times. Thus there is a 
time nfc, k < Uk < 3k, for which s^, 6*) < CA;-(''^^)/'^(log A;)^ and 

zl^-f{xn,+i)<2Rk-\ 

Let / have minimum z* at x* . For k large, Xnf,+i will have been chosen 
by expected improvement (rather than being an initial design point, chosen 



22 



at random). Then as is non-increasing in n, for 3A: < n < 3{k + 1) we 
have by Lemma 8, 



r{R/<y) 

T{-R/a) 



T{R/a) , 

< ( 2Rk-' + C{R + a) ^-('^^i)/^ (log k)^ 

T{—R/a) \ 

This bound is uniform in / with ^ so we obtain 

K{EI{7r),ne{X),R) = 0{n~^'''^'y^{lognf). □ 

A. 3 Estimated Parameters 

To prove Theorem 3, we first estabhsh lower bounds on the posterior vari- 
ance. 

Lemma 9. Given 9^,9^ € pick sequences Xn ^ X, 9^ < 9n < 9^ . 
Then for open S C X, 

supsn{x;9n) = n{n-''/'^), 
uniformly in the sequences Xn, 9n- 

Proof. S is open, so contains a hypercube T. For /c G N, let n = 

and construct 2n functions on T with ||^m||-^ uW — ^' ™ proof 

of Theorem L Let = ULii^Y /^i)'^ then by Lemma 4, Um\\ne„{X) < C. 

Given n design points xi,...,Xn, there must be some ipm such that 
i'mixi) = 0, 1 < i < n. By Lemma 6, the posterior mean of ipm given these 
observations is the zero function. Thus for x € T minimizing ipm, 

Sn(x;en) >C-is„(x;0„)||^^||^^^(^) >C-Vm(x)-0| =J^(A:-"). 
As Sn{x;9) is non-increasing in n, for ^{2{k — 1))'^ <n < ^{2k)'^ we obtain 
sups„(x;6'„) > supsi(2j,)d{x;9n) = Qik'") = 0(n~''/'^). □ 

Next, we bound the expected improvement when prior parameters are 
estimated by maximum likelihood. 

Lemma 10. Let < -R; Xn,yn G X. Set In{x) = < - f{x), 

Sn{x) = Sn{x;9n), and tn{x) = In{x) / Sn{x) . Supposc: 
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(i) for some i < j, Zi ^ Zj; 

(ii) for some Tn —5- — oo, < whenever Snixn+i) > 0; 
(in) IniVn+i) > 0; and 

(iv) for some C > 0, > e"'^/'^". 

Then for 7r„ as in Definition 2, eventually 7r„) < EIn{yn+i]'^n)- 

If the conditions hold on a subsequence, so does the conclusion. 

Proof. Let R^{0) be given by (7), and set Rl = Rl{On). For n > j, R^ > 0, 
and by Lemma 4 and Corollary 1, 

d 

Rl<\\f\\l.jx)<s' = R'\{{oYieh- 

i=l 

Thus < (T^ < S'^Cn- Then if Sn{x) > 0, for some \un{x) — tn{x)\ < S, 

EIn{ (x)r(n„(x)/(T„), 

as in the proof of Lemma 8. 

If Sn{xn+i) = 0, then x„+i G {xi, . . . ,x„}, so 

EIn{Xn+i;TTn) = < £'/„ (y„+i ; 7r„) . 

When Sn{xn+i) > 0, as r is increasing we may upper bound EIn{xn+i',T^n) 
using Un{xn+i) < Tn + S , and lower bound EIn{yn+i;T^n) using n„(?/„+i) > 
—S. Since Sn{xn+i) < 1, and t{x) = 0(x~^e~^ as x — >• — oo (Abramowitz 
and Stegun, 1965, §7.1), 

EIn{Xn+i;T^n) ^ r((T^ + S) / CTn) 
EIn{yn+l]T^n) ~ e''^ I ''^T^S j On) 

= O {{Tn + 5')-2gC/c„-{T2+2ST„)/2a2^ 

= O {{Tn + 5)-2e-(^n+25T„-2CS2)/252c„^ 

= o(l). 

If the conditions hold on a subsequence, we may similarly argue along that 
subsequence. □ 

Finally, we will require the following technical lemma. 

Lemma 11. Let xi, . . . ,x„, he random variables taking values in . Given 
open S C , there exist open U Q S for which F(ljr=i{^« & U}) is arbi- 
trarily small. 
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Proof. Given e > 0, fix m > n/e, and pick disjoint open sets Ui, . . . , Um C S. 
Then 

m 

so there exists Uj with 

IP G ^j}^ < H#{xi G t/j}] < n/m < e. □ 

We may now prove the theorem. We wih construct a function / on 
which the EI{n) strategy never observes within a region W . We may then 
construct a function agreeing with / except on W , but having different 
minimum. As the strategy cannot distinguish between / and g, it cannot 
successfully find the minimum of both. 

Proof of Theorem 3. Let the EI[t:) strategy choose initial design points 
xi, . . . , Xfc, independently of /. Given e > 0, by Lemma 11 there exists open 
C/o C X for which • • • ,Xfc G C/q) < e; we may choose Uq so that 

Vq = X\Uq has non-empty interior. Pick open Ui such that Vi = lJi C Uq, 
and set / to be a C°° function, on Vq, 1 on Fi, and everywhere non- 
negative. By Lemma 1, / G 1-Lqu{X). 

We work conditional on the event A, having probability at least 1 — e, 
that z| = 0, and thus z* = for all n > k. Suppose Xn G Vi infinitely 
often, so the Zn are not all equal. By Lemma 7, Sn{xn+i;On) — t- 0, so on a 
subsequence with Xn+i G Vi, we have 

in = « - f{Xn+l))/Sn{Xn+i;On) = - Sn{Xn+i; On)'^ -CO 

whenever Sn{xn+i; On) > 0. However, by Lemma 9, there are points y„ G Vq 
with - fiun+i) = 0, and Sn{yn+i]On) = ^{n~''/'^). Hence by Lemma 10, 
EIn{xn+i]T^n) < Elnivn+i] '^n) for some n. Contradicting the definition of 

Xn+l- 

Hence, on A, there is a random variable T taking values in N, for which 
n > T =^ Xn Vi. Hence there exists a constant t G N for which the 
event B = Ar\{T < t} has P^^^'^^ -probability at least 1 — 2e. By Lemma 11, 
we thus have an open set W C Vi for which the event 

C = Bn{xn^W :nen} = Bn{xn^W ■.n<t} 

has P J -probability at least 1 — 3e. 

Construct a smooth function g by adding to / a C°° function which is 
outside W, and has minimum —2. Then min^ = — 1, but on the event C, 
EI{-k) cannot distinguish between / and g, and ^(x*) > 0. Thus for 5 = 1, 

Pf^(*) {infg{x*J -mmg>6^> Pf^W(C7) = Ff^^\c) > 1 - 3e. 
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As the behaviour of EI{t:) is invariant under rescahng, we may scale g to 
have norm ||5'||-^g(x) — ^^"^ above remains true for some (5 > 0. □ 

Proof of Theorem 4- As in the proof of Theorem 2, we will show there are 
times nfc when the expected improvement is small, so f{xn^) must be close 
to the minimum. First, however, we must control the estimated parameters 
O'n, On- 

If the Zn are all equal, then by assumption the x„ are dense in X, so 
/ is constant, and the result is trivial. Suppose the not all equal, 

and let T be a random variable satisfying zt ^ Zi for some i < T. Set 
U = infgL<5i<5i!7 Rt{0). Rt{P) is a continuous positive function, so f/ > 0. 
Let S2 = R^\\i=i{eY /e^). By Lemma 4, ||/||^. < S, so by Corollary 1, 
for n >T^ 

U < Rt{0„) < CTn < WfWn. <X)<S- 

As in the proof of Theorem 2, we have a constant C > 0, and some n^, 
k < rik < 3k, for which z*^ - f{xnf,+i) < 2Rk^^ and s„j^(x„j^+i; 6'„j^) < 
CA:-"(logfe)^. Then for k>T,3k< n < 3(A; + 1), arguing as in Theorem 2 
we obtain 



< ( 2Rk-' + C{S + anjk 



r(-5/(T„J 



'{uAl)/d 



(logkY 



< (2Rk-^ + 2CSk-'^^'^'^/''{logky 

t[-S/U) V 



We thus have a random variable C satisfying — < C'n ('^^i)/'^(logn)^ 
for all n, and the result follows. □ 

A. 4 Near-Optimal Rates 

To prove Theorem 5, we first show that the points chosen at random will be 
quasi- uniform in X. 

Lemma 12. Let x„ he i.i.d. random variables, distributed uniformly over 
X, and define their mesh norm, 

n 

hn '■= sup min||x — 

For any 7 > 0, there exists C > such that 

¥{hn > C{n/ log n)'^/'^) = 0{n-^). 
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Proof. We will partition X into n regions of size 0(n~^/'^), and show that 
with high probability we will place an Xi in each one. Then every point x 
will be close to an Xi, and the mesh norm will be small. 

Suppose X = [0, l]"^, fix A: G N, and divide X into n = k'^ sub-cubes 
Xm = ii'm + [0, l]'^), for m G {0, . . . , A; - l}''. Let Im be the indicator 
function of the event 

{xi ^ Xm ■■ I < i < [7nlognJ}, 

and define 



/i„ = E 



m 



nE[/o] = n{l - l/n)L^"'°s"J ~ ne-^'°s™ ^ „-(7-i). 



For n large, /^n ^ 1; so by the generalized Chernoff bound of Panconesi and 
Srinivasan (1997, §3.1), 




On the event Ylm^rn < 1, /m = for all m. For any x G X, we then 
have x G Xm for some m, and Xj £ Xm for some 1 < j < [7nlognJ. Thus 

[■ynlogn] 

min \\x — XiW < \\x — xA\ < V dk 
As this bound is uniform in x, we obtain /i['yniogn,j 

< Vdk'^. Thus for 

n = k'^, 

and as is non-increasing in n, this bound holds also for k'^ <n < {k + V)'^. 
By a change of variables, we then obtain 

P(/i„ > C(n/7logn)"i/^) = 0((n/7 log n)-(^-i)), 

and the result follows by choosing 7 large. For general X, as X is bounded 
it can be partitioned into n regions of measure 0(n~^/'^), so we may argue 
similarly. □ 

We may now prove the theorem. We will show that the points x„ must 
be quasi-uniform in X, so posterior variances must be small. Then, as in the 
proofs of Theorems 2 and 4, we have times when the expected improvement 
is small, so /(x*) is close to min/. 

Proof of Theorem 5. First suppose < 00. Let the EI[ ■ , e) choose k initial 
design points independent of /, and suppose n > 2k. Let An be the event 
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that [|nj of the points Xk+i, . . . ,Xn are chosen uniformly at random, so by 
a Chernoff bound, 

Let Bn be the event that one of the points Xn+i, ■ ■ ■ ,X2n is chosen by ex- 
pected improvement, so 

Finally, let C„ be the event that An and Bn occur, and further the mesh 
norm hn < C{n/ log n)~^^'^, for the constant C from Lemma 12. Set r„ = 
(n/ log n)^'^/'^(log n)". Then by Lemma 12, since Cii CZ 

for a constant C" > not depending on /. 

Let £'/( • , e) have prior 7r„ at time n, with (fixed or estimated) parame- 
ters On, Bn. Suppose < and set 5^ = i?^ nti(^f /^f ), so by 

Lemma 4, (x) ^ 'S'- If a = 0, then by Narcowich et al. (2003, §6), 

sup Snix;Q) = 0{M(Q)h!'n) 

uniformly in 0, for M{6) a continuous function of 9. Hence on the event Cn, 
sups„(x;6'„) < sup sup Sn{x]9) < C"rn, 

for a constant C" > depending only on X, K, C, 9^ and 9^ . If a > 0, the 
same result holds by a similar argument. 

On the event C„, we have some Xm chosen by expected improvement, 
n < ni < 2n. Let / have minimum z* at x*. Then by Lemma 8, 

z^_i - z* < Elm-iix*; ■ ) + C"Srm-i 

< Elm 

< (/(x„„i)-/(x^))+ + C"(25 + 

< 7* _ 4_ C"Tr 

for a constant T > 0. (Under EI{Tr,e), we have T = 2S + a; otherwise 
Cm-i < by Corollary 1, so T = 35".) Thus, rearranging, 

^2n ~ Z* ^ ~ ^* — C'TVn- 

On the event C^, we have - z* < 2||/||^ < 2R, so 

^f^-^'\zln+l-Z*]<^f^-^%ln-^*] 

< 2BFf^'''\c^n) + C"Trn 

< {2C'R + C"T)rn. 

As this bound is uniform in / with \\f\\y_ jj{x) — ^^^^ result follows. If 
instead = oo, the above argument holds for any v < oo. □ 
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